1 Introduction

Within mass customization, manufacturing companies face the challenge of combining productivity increases and high-quality standards. Quality controls are carried out after selected production steps to achieve the required quality standards. They represent measuring points within production for checking quality″​=relevant product properties. Today, quality control is often a visual inspection by trained personnel using checklists. For example, in the assembly of small electronic products, a worker checks that the necessary screws haven been fitted and that the component has been properly assembled. These manual checks are time″​=consuming and cost intensive. Also, they are characterized by a high degree of monotony in the form of repetitive processes with a fixed sequence and are prone to errors [1].

Especially visual inspections offer a high potential for automation. Advances in deep learning show innovative possibilities for industrial production. Automated visual inspections in production can be carried out using machine vision. The quality gates consist of one or several cameras, light source, trigger, production line control and image processing software [2]. However, training the image processing software using deep learning methods requires large amounts of data in order to perform.

The basis for a successful implementation of deep learning methods is high-quality annotated training data [2]. Current approaches use extensive amounts of annotated real image datasets for this purpose. Especially, data collection of image datasets are one of the most time″​=consuming and cost″​=intensive steps in the implementation of deep learning methods [3]. Synthetically generated training image data provide a remedy. Training with synthetic data offers many advantages for automated quality control especially in mass customization with high number of variants since unlimited amounts of data can be produced [4]. Within object detection approaches to apply synthetic data have been developed in recent years see [5, 6, 7] . However, the application of synthetic data for machine vision quality gates in automation is currently a research gap.

The aim of this paper is to investigate the use of synthetic training data for machine vision quality gates in mass customization. Using CAD and rendering software, three different training datasets are generated: 1) the first training dataset forms the baseline and consists entirely of real image data; 2) the second training dataset contains exclusively synthetic training data; 3) the training dataset is hybrid containing 95% synthetic and 5% real image data. For the comparison of the three approaches, Accuracy, Precision, Recall and F1-Score are used as evaluation criteria.

For validation purposes, the assembly of the open-source jointed-arm robot “Zortrax”, which is produced in the Smart Automation Laboratory of the Heinz Nixdorf Institute, serves as an example [7]. Within this work one assembly step of the Zortrax robot is evaluated using deep learning methods trained on synthetic image data. The research contribution is intended to evaluate the fundamental question of whether synthetically generated image data is generally suitable for machine vision quality gates.

2 State of the Art

Recently, synthetic data is applied in the context of improving the performance of object detection algorithms (see [5, 7]). Using synthetic data in the context of industrial production and for training machine vision quality gates still represents a research gap.

2.1 Machine Vision Quality Gates

Machine vision quality gates document the product quality and trace products throughout the production. Usually, the quality of products is evaluated after defined production steps, which are critical for the product quality. Therefore, assembly stations, machines or productions lines are extended using cameras to perform visual quality inspections in order to remove or rework defective products [2].

In order to fulfill visual quality inspections different abilities are needed. Some common tasks of machine vision quality gates are [8, 9]: (1) counting pixels, (2) template matching, (3) segmentation, (4) barcode reading, (5) object identification, (6) position detection, (7) completeness checks, (8) shape and dimensional inspections, (9) surface inspection. The typical hardware components of machine vision quality gates are (see Fig. 7.1; [2]):

  1. (1)

    Programmable Logic Controller (PLC) which steers the production process

  2. (2)

    Trigger is used to active the image acquisition process over the camera(s)

  3. (3)

    Conveyer Belt mechanically transports the products through the productions

  4. (4)

    Illumination of the object using special designed lightning ensuring a high image quality

  5. (5)

    (Smart) Camera(s) the object is imaged using lenses and light sensors

  6. (6)

    (Edge) Computer directly build into the camera or standalone classifies the acquired image

  7. (7)

    Machine Vision Software inspects the image and returns an evaluation to the PLC (for example MVTec Halcon)

Abb. 7.1
figure 1

Components of machine vision quality gates (based on: [9]).

However, depending upon the task machine vision quality gates might differ slightly. Visual inspection on manual assembly stations usually do not require a mechanically transportation and the triggers are controlled over buttons/foot pedals.

2.2 Synthetic Training Data

Synthetic training data has already been used in computer vision approaches. Applications are object detection [5, 7] , classification [6] or segmentation [10, 11]. Within the application areas, it is either impossible to collect labelled training data manually, or impossible to obtain data. In these cases, synthetic trainings show great success.

Existing approaches that are trained using real image data go through the following seven steps, which optionally iterate if key performance indicators are not reached: (1) data acquisition, (2) data cleaning, (3) data annotation, (4) model training, (5) model testing and (6) model deployment (see Fig. 7.2). Within the first step data are acquired from imaging devices like a camera (typically captured as images or sequences from videos). Secondly, each image data is pre-processed for the purpose of standardization, these include resizing, blurring, rotating, etc. Thirdly, the pre-processed image data is annotated assigning metadata in form of classes or key to the image. Furthermore, the selected DL model is trained. Within the next step the model is analyzed using test data. Usually, test data is excluded from model training ensuring the validity of the trained model. Additionally, key performance indicators are used to evaluate the performance.

Abb. 7.2
figure 2

Comparison of real-world dataset iterations vs. synthetic dataset iterations

Abb. 7.3
figure 3

Synthetic training data generation pipeline (based on [5]).

If the defined key performance indicators conclude poor results possibly training parameters needed to be improved, or more training data is needed. Lastly, if the key performance indicators reach the determined quality level the model can be exported and used for deployment.

The steps of data acquisition, data cleaning and data annotation are the most time″​=consuming and cost″​=intensive steps in the implementation of DL algorithms [3]. Synthetic data can bypass these steps (see Fig. 7.3) and reduce the implementation time significantly. Additionally, synthetic training data does not need the physical object, which enable fast data generation for multiple tasks. Bridging the domain gap using synthetic training data can be executed by simply cutting out relevant objects from real images and mapping them onto random selected background as shown in [11]. This approach bridges the domain gap by using images from real domains, which therefore are able to bridge the domain gap. The drawback of this approach is clearly the missing possibility to generate synthetic training images from different perspectives and light conditions, which states a limitation.

Domain Randomization (DR) was introduced in [4], enabling synthetic images from different perspectives and light conditions. This approach randomly creates a 3D environment using varieties of textures, numbers of light sources, color, background as well as foreground objects. The aim of DR is to bridge the domain gap using a sufficient number of variations. The drawback of DR is that large amounts of training data is needed and that networks are unable to correctly identify small differences within similar classes.

The Unity Perception package introduced by [5] builds upon DR enabling customizable image data generation out of the box including ground truth annotations. The package supports 2D/3D object detection, semantic segmentation, instance segmentation, and keypoints estimation. Several settings make it possible generating millions of annotated image data.

3 Synthetic Training Data Generation for Machine Vision Quality Gates

Within this work, the use of synthetic training data in the context of machine vision quality gates is explored. Therefore, a pipeline (see Fig. 7.4) is introduced using the Unity Perception package to generate synthetic training data in the context of machine vision quality gates in mass customization.

Abb. 7.4
figure 4

Assembly stations equipped with graphical user interface and machine vision quality gate.

3.1 Synthetic Data Generation Pipeline for Classification

Within this section, the pipeline of creating synthetic annotated training data for classification is explained in detail. The basis for the pipeline is Domain Randomization. Creating synthetic annotated training datasets for classification tasks requires five basis steps (see Fig. 7.4): (1) generate/collect CAD data, (2) convert data format, (3) data import, (4) build environment (5) set parameters and generate image data.

The first step to create synthetic image data is to generate (e.g., using Solidworks) or collect CAD data. In the context of industrial production, CAD data usually is created within product creation. The required level of details of CAD data depends upon the subsequent application for which the image data will be used. Missing individual parts inevitably lead to fails bridging the domain gap.

After the CAD data is acquired, the data format needs to be converted into Unity friendly formats, for example Filmbox (.fbx). A software tool able to convert STEP to Filmbox is Autodesk 3DS MAX or Blender. Subsequently, the converted Filmbox data is imported into the Unity software environment.

In the third step, the 3D environment is created. Therefore, the Unity Perception packet is used [5]. To create a 3D environment capable of bridging the domain gap several aspects needed to be considered. According to [5], the 3D environment need to contain the classified object, background noise objects, occluding noise objects, and randomized object textures. These settings are programmed within the Unity Perception package. The result is presented in Fig. 7.3. There are three layers of objects creating the necessary variation into the synthetic environment.

Lastly, the environmental parameters of randomized object color, lighting parameters, camera post and camera movement are programmed. As soon as the 3D environment is set, 2D images can be captured out of the synthetic 3D scene. Additionally, each captured image is labeled automatically.

4 Validation

Within this section, detailed experiment settings and results are reported. Furthermore, validation settings are documented which were used to train deep neural networks in the context of production. Finally, the chapter summarizes validation results.

4.1 Smart Automation Laboratory

The Smart Automation Laboratory located at the Heinz Nixdorf Institute (Paderborn, Germany) is used for research in the fields of production planning and control as well as automation technology, such as machine vision quality gates. The laboratory consists of three manufacturing cells, a material flow system and an adaptive assembly workstation (see Fig. 7.4). The manufacturing cells and the assembly workstation are connected via the rail-bound material flow system. Shuttles are used on the rails, which receive orders from the ERP system and process them in communication with the manufacturing cells and the assembly station [12, 13] . Currently, the open-source robot arm Zortrax is manufactured at the laboratory, which consist of 34 separate part and require 29 assembly steps [13].

The assembly workstation is equipped with boxes for parts, a camera including lighting source and an assistant system. The assistant system visualize status, assembly description, required parts, animated video and real time camera cast (see Fig. 7.4). The real time camera cast is used to evaluate the process and the quality of the assembly process. Furthermore, the classification is done using deep neural networks, which were initially trained using real image data. Within this research these deep neural networks are trained using synthetic training images and validated using real image data.

4.2 Validation Setting

Within the validation, the aim was to properly classify one sequential assembly step of the robot arm Zortrax (see Fig. 7.5). The aim of the quality inspection was to properly evaluate the assembly step 12 using synthetically trained DL models.

Abb. 7.5
figure 5

Classification problem of assembly step 12.

As visible in Fig. 7.5 the assembly step needs various parts and includes several work steps. First, the ring needs to be placed under the carrier. Subsequently, the tooth ring needs to be placed under the carrier and the ring to be correctly assembled. Forgetting the ring in between the carrier and tooth ring leads to a quality deficiency (wrong assembly, see Fig. 7.5). If the parts are placed in proper order the carrier is attached to the tooth ring using eight screws and nuts (correct assembly, see Fig. 7.5). This forms a binary classification problem to correctly supervise the assembly step 12.

Three datasets were acquired and annotated (see Table 7.1). All trained DL models were evaluated using the same real test and validation image data. The first out of three-training dataset contains 1000 real images per class (i.e., assembly step 12) captured using a standard camera and ring illumination. The second training dataset purely contains synthetically generated image data using the previous explained pipeline. The last dataset is a hybrid dataset containing 50 real images and 950 synthetically generated images per class. This was done based on the results of [3] finding out that already small percentages of real image data in synthetically generated image data boost the performance of DL algorithms.

Tab. 7.1 Summary of training, test and validation datasets.

Within this work three DL algorithms were trained for classification. These are: (1) DenseNet 201 [14], (2) ResNet 152 v2 [15] and (3) Xception [16]. Each model was trained using the above listed training″​ , test- and validation datasets with the target size 224 × 224. All models were trained with the optimizer Adam, 0.001 learning rate and 10 epochs. Table 7.1 summaries the distribution of image data used for training.

4.3 Key Performance Indicators

Within the validation the models were evaluated using the following key performance indicators (see Table 7.2). This was done to evaluate trained DL models. The key performance indicators are used as measurements for comparison and validation. The most common metrics for supervised learning are summarized in Table 7.2.

Tab. 7.2 Used key performance indicators (based on [2])

4.4 Results & Discussion

In this section the performance of trained DL models using synthetic training data is compared to real-world training data. Table 7.3 displays the results for each of the trained DL models. The results show that the performance of synthetic training data generally can bridge the domain gap. The quality of assembly step 12 can moderate been inspected using synthetic generate training data.

Tab. 7.3 Summary of results after 10 training epochs.

The findings support the thesis that synthetically generated training data can be used for machine vision quality gates in production. Synthetically generated training data created using Domain Randomization and the Unit Perceptions package provide the ability to quickly generate training datasets. The outcome offers many advantages for automated quality control, especially in mass customization with high number of variants, since unlimited amounts of training data can be produced.

However, compared to the baseline it can be seen that a slight domain gap still exists between training with real and synthetic data. To further bridge the domain gap, improvements needed to be done. These include the synthetic data generation pipeline as well as the training settings. Additionally, supported by [5], more training data might be needed for training using synthetic datasets. Within their approach [5] several thousand image data were generate for one class.

Lastly, the evidence shows that using small amounts of real data (third training dataset) already improves the performance of the DL model significantly. Therefore, the results also support the thesis claimed by [5, 7] stating that small sets (5%–10%) of real data boost the performance of trained DL models.

5 Summary & Future Studies

The results show that synthetically generated training datasets are fundamentally suitable for training machine vision quality gates. This offers great potential to relieve process and production developers in the development of quality gates in the future. Synthetically generated image data seems to enable automatic quality controls for high number of product variants especially in the field of mass customization.

To conclude, domain randomization using the Unity perception package to create synthetic image data is a promising research direction to be future examined. This paper presents a solution for a classification model to inspect one assembly step of the robotic arm Zortrax trained on synthetic, hybrid and real image data at high accuracy. The trained models enable early error detection in the assembly process and are therefore able to ensure a high quality in the assembly. Since the chosen example is still a very simple binary classification problem, the results show very good key performance indicators, as expected. In order to further bridge the domain gap improvements might be needed.

Within in future studies, the pipeline to create synthetic image data needs to be further improved. Settings such as background and foreground objects, texture and different parameter setting need to be further examined. In this, an increase in performance and generalization is to be expected. Future work will explore how to make this technique reliable and effective to use in industrial production.